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Abstract 

Protein metabolism is one of the most costly processes in the cell and is therefore expected to be under the effective control of natural 
selection. We stimulated yeast strains to overexpress each single gene product to approximately 1 % of the total protein content. 
Consistent with previous reports, we found that excessive expression of proteins containing disordered or membrane-protruding 
regions resulted in an especially high fitness cost. We estimated these costs to be nearly twice as high as for other proteins. There was a 
ten-fold difference in cost if, instead of entire proteins, only the disordered or membrane-embedded regions were compared with 
other segments. Although the cost of processing bulk protein was measurable, it could not be explained by several tested protein 
features, including those linked to translational efficiency or intensity of physical interactions after maturation. It most likely included a 
number of individually indiscernible effects arising during protein synthesis, maturation, maintenance, (mal)functioning, and disposal. 
When scaled to the levels normally achieved by proteins in the cell, the fitness cost of dealing with one amino acid in a standard protein 
appears to be generally very low. Many single amino acid additions or deletions are likely to be neutral even if the effective population 
size is as large as that of the budding yeast. This should also apply to substitutions. Selection is much more likely to operate if point 
mutations affect protein structure by, for example, extending or creating stretches that tend to unfold or interact improperly with 
membranes. 
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Introduction 

Proteins constitute a major component of the dry mass of a 
cell. Synthesis of amino acids and subsequent assembly of 
polypeptides are costly. The two processes are estimated to 
consume about one-half of the ATP molecules in a growing 
yeast cell and involve a large fraction of its nucleic acids and 
ribosomal proteins (Verduyn 1991; Warner 1999). The huge 
cost of protein synthesis has been recognized as such for de- 
cades (Maaloe and Kjeldgaard 1966; Waldron and Lacroute 
1975). More recently, it has been shown that newly assem- 
bled polypeptides are released into a crowded environment of 
macromolecules in which their folding is easily derailed (Ellis 
2001). They often end up in a form that is not only unproduc- 
tive but can also be toxic and sometimes resistant to degra- 
dation (Stefani and Dobson 2003; Winklhofer et al. 2008). 
However, while it is certain that the costs and risks associated 
with the turnover of the total protein load are large, it remains 
unknown how much individual protein species differ in this 
respect. In theory, it is possible to calculate the cost of protein 



synthesis because the substrates and the process are well 
known. However, the required parameters are many and 
they have not yet been estimated with sufficient accuracy 
(von der Haar 2008; Siwiak and Zielenkiewicz 2010). 
Because the routes of folding and degradation for different 
polypeptides are still underway, the energy or fitness costs 
associated with such events are presently impossible to 
assess (Hartl et al. 201 1). Thus, it remains a great challenge 
in current research to provide analytical, experimental, or com- 
putational estimates of selective pressures acting on individual 
proteins. 

Evidence that different proteins experience different selec- 
tive forces on traits other than their primary functions can be 
extracted from the DNA sequence. In particular, it is well es- 
tablished that the rate of molecular evolution differs widely 
between genes and that those expressed the most are the 
ones that change the least (Sharp 1991; Pal et al. 2001). 
One explanation could be that the highly expressed genes 
mutate at a lower rate, a possibility that has gained some 
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support recently (Martincorena etal. 2012). Most researchers, 
however, believe that more highly expressed genes are under 
stronger purifying selection. Some of the tentative explana- 
tions invoke functional arguments: importance (essentiality) of 
function, multiplicity of functions, centrality to metabolic net- 
works, number of transcription factors assisting expression, or 
enrichment for genetic and/or physical interactions (Fraser 
et al. 2002; Jordan et al. 2003; Bloom and Adami 2004; 
Wall et al. 2005; Pal et al. 2006; Vitkup et al. 2006; Xia 
et al. 2009). For each of these factors, however, correlation 
with the rate of evolution is much lower than that for the level 
of gene expression (Rocha 2006; Wang and Zhang 2009). 
Thus, it appears that it is the amount of protein product 
that matters most. This could mean that selection tends to 
purge mutations located in highly expressed genes because 
they lead to a greater waste of resources (Barton et al. 2010; 
Vieira-Silva et al. 201 1 ). Not only efficient use of materials and 
energy but also a high rate of translation can be important. 
This could result in selection for optimal codon usage in the 
highly expressed genes (Akashi 2001; Plotkin and Kudla 
2010). The more protein molecules, the higher the toxic 
effect after misfolding; therefore, misfolding-resistant se- 
quences should especially be preserved in highly expressed 
genes, which would constrain their evolution (Drummond 
et al. 2005; Drummond and Wilke 2008; Yang et al. 2010). 
In sum, there is no lack of hypotheses for how the amount 
of synthesized protein could dictate the rate of molecular 
evolution. However, these hypotheses have been conceived 
through comparative analyses of DNA/protein sequences and 
have been verified mostly in the same way. In this article, we 
report the results of a study aimed at testing these hypotheses 
experimentally, which has so far been addressed by only a few 
researchers. 

The postulate of controlled alteration of selected determi- 
nants of the protein production cost has proved difficult to 
implement. For example, changing the actual codon usage to 
a devised one alters the stability and hence the abundance of 
the resulting mRNA variants. The effect of mRNA abundance 
can be more important than the sought effect of mRNA com- 
position (Kudla et al. 2009; Agashe et al. 2013). Even the 
seemingly straightforward task of demonstrating that over- 
production of unnecessary proteins is disadvantageous has 
proved challenging. There must be costs associated with syn- 
thesis of redundant polypeptides, but there are also costs of 
their presence in the cell and their interactions with cell struc- 
tures (Stoebel, et al. 2008; Plata, et al. 2010; Eames and 
Kortemme 2012). Our approach is based on the assumption 
that universal costs of protein expression do exist and can be 
at least partly disentangled if the number and diversity of an- 
alyzed proteins are sufficiently large. We relied on a genomic 
collection of yeast strains, each overexpressing a single pro- 
tein. Two previous studies measured approximately how 
much protein was overproduced and categorized the 
growth effects accompanying this overproduction (Gelperin 



et al. 2005; Sopko et al. 2006). One experiment measured 
fitness using a quantitative assay but the level of production 
was not estimated and the average production could not be 
calculated as the applied protocol of overexpression differed 
from those used earlier (Yoshikawa et al. 201 1). We therefore 
carried out our own assays in which we stimulated genes to 
moderate protein overproduction, measured overexpressed 
protein levels quantitatively, and estimated the growth rate 
with high accuracy. 

We first examined our data by asking whether the fitness 
effect of overexpression was heavily dependent on the cellular 
role of a tested gene. It was not, as we found by reviewing 
gene annotations. This was encouraging because we could 
assume that the effect of metabolic deregulation would not 
obscure the effect of carrying useless or toxic protein mole- 
cules. We thus asked which of the several protein properties 
could be the best predictor of fitness variation. We confirmed 
previous reports showing that proteins containing transmem- 
brane (Kitagawa et al. 2006; Osterberg et al. 2006) and dis- 
ordered (Vavouri et al. 2009; Ma et al. 2010) regions are 
especially costly to fitness when overexpressed. Crucially, we 
compared quantitatively these costs with the cost of express- 
ing normal (well-structured cytosolic) proteins. We found that 
the cost of expressing well-structured cytosolic proteins is very 
low when scaled to one amino acid addition (and thus also 
substitution). 

Materials and Methods 

Strains 

We used a previously constructed collection of single yeast 
open reading frames (ORFs), each with the same inducible 
promoter P G/U7 followed by the same tandem affinity tag 
(His6, HA epitope, protease 3C site, ZZ domain, 19kDa) 
cloned into a multicopy plasmid (Gelperin et al. 2005). 
Plasmids were hosted by the haploid yeast strain Y258. 
Most of the cloned genes had been tested for errors; only 
approximately 3% of them were likely to have an undetected 
mutation (Gelperin et al. 2005). 

Fitness Assays 

The overexpression strains were inoculated directly from 
plates shipped by the distributor (Open Biosystems) into 
200 |il of SC with glucose but lacking uracil to stabilize the 
plasmid. To stimulate overexpression, we used synthetic com- 
plete (SC) with raffinose as a source of carbon and galactose 
as an inducer, according to a protocol described in the original 
study that led to moderate overexpression. We then trans- 
ferred 10 uJ aliquots of each culture into 190ul of fresh glu- 
cose medium and incubated for 48 h. From these cultures, 1 0- 
uJ aliquots were transferred to 1 35 uJ of SC with raffinose for 
another 48 h. The raffinose cultures were diluted ten times 
and the optical densities (ODs) measured. These cell 
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suspensions were diluted again at 1:50 in SC with raffinose 
and galactose (2% each). In this growth/induction medium, 
the cultures were allowed to grow for 20 h, at which point 
their ODs were determined. The ratio of the two OD mea- 
surements, which were corrected for the dilution factor, 
served to calculate the number of cell doublings for each cul- 
ture. All growth assays were carried out at 30 °C. 

Protein Assays 

Overproduction of proteins was induced by transferring cells 
sequentially from glucose to raffinose, and then to raffinose/ 
galactose medium for 8 h. The cells were then centrifuged, 
washed with ice-cold water, and frozen. To extract proteins, 
the cells were beaten with glass beads in 1 00 uJ of lysis buffer 
(50 mM Tris-HCI, pH 7.5, 0.5% sodium dodecyl sulphate, 
0.1 mM ethylenediaminetetraacetic acid, protease inhibitors) 
for 4 h at 4°C. Cell remnants were then spun down, and the 
supernatants were collected. Total protein content was deter- 
mined using a bicinchoninic acid (BCA) protein assay. For a 
competitive ELISA assay, plates were coated overnight at 4°C 
with 0.05 |il of normal rabbit serum (Pierce) diluted in 1 00 ul 
of 0.2 M carbonate-bicarbonate buffer, pH 9.4. After wash- 
ing, plates were blocked with 300 u.l of 2% bovine serum 
albumin (BSA) for 24 h. The yeast protein extracts were 
mixed with protein A conjugated to peroxidase (Pierce) then 
1 00 \i\ of the resulting mixture was added to the blocked plate 
wells, for a total 1 0 jj.g of total yeast protein and 25 ng 
(-26 nil) of protein A per well. After 1 h of incubation, the 
mixtures were discarded and the wells washed and filled with 
100nl of the 3,3',5,5'-tetramethylbenzidine (TMB) substrate. 
The reaction was terminated after 30 min with 1 00 of 2 M 
H 2 S0 4 , and then, the absorbance at 450 nm was measured. 
All washing steps were performed with 200 ul of phosphate- 
buffered saline containing 0.05% Tween 20. One of the 
tagged proteins (Ade2p) was purified, diluted into a gradient 
of known concentrations, and used as a standard to calibrate 
the reads. 

Gene Ontology and Protein Properties 

To analyze the GO categories (Saccharomyces Genome 
Database [SGD]), we applied an ANOVA model in which 
each of the 5,084 overexpressed genes was described by 
the Yeast Slim categories taking values of zero or one 
(absent or present). We used the "Im" function of the R pack- 
age, followed by the "step" function (based on Akaike 
Information Criterion [AIC]) to reduce the number of pre- 
dictor variables by eliminating the nonsignificant ones (R 
Development Core Team 201 0). The analyses were performed 
separately for the molecular function, cellular component, and 
biological process classifications. As these classifications con- 
tained tens of terms, we did not analyze interactions between 
them because the latter were very numerous and usually con- 
tained too few data points to be meaningful. 



Protein properties were analyzed by implementing a mul- 
tiple regression model using the "Im" function. Continuous 
predictor variables were log-transformed (except for gravy 
score and mRNA 5' folding energy); a small constant was 
added to those with zero values before transformation (Wall 
et al. 2005). The continuous predictor variables included: 
mRNA abundance (Garcia-Martinez et al. 2004), protein 
half-life (Belle et al. 2006), intrinsic disorder/protein length + 
0.01 (Linding et al. 2003), protein length (SGD), CAI+0.1 
(SGD), gravy score (SGD), and protein abundance, that 
is, the number of molecules per protein species 
(Ghaemmaghami et al. 2003). To calculate the energy of 
structures at the 5'-end of mRNAs, we used the Vienna 
RNA Package 2.0 (Lorenz et al. 201 1) for stretches extending 
from the -4 to +37 nucleotide positions (Plotkin and Kudla 
2010). All continuous predictor variables were standardized 
prior to analysis. There were also two categorical variables: 
physical interaction status (not hub, intermediate number of 
interactions, party hub, and date hub) (Han et al. 2004; Ekman 
et al. 2006) and the presence of transmembrane segments 
(not predicted, predicted by only one study, and predicted by 
two studies) (Persson and Argos 1994; Krogh et al. 2001). 
ORFs with missing values in any of the predictor variables 
were excluded from this analysis. There were 2,913 ORFs 
with a complete set of predictors, and only those were in- 
cluded in the final orthogonal model. We included all ten 
listed variables in the model and the first order interactions 
between them (except for interactions between the two cat- 
egorical variables). The entire procedure was repeated 40 
times with random permutations of the order of categories 
in the model. The P values for predictor variables were aver- 
aged over repeats (geometrically). 

Results 

Fitness Effects of Moderate Overexpression of Genes 
Are Small 

We found that an overproduced protein species constituted 
typically approximately 1 % of the total protein amount (more 
detailed data reported later), which is much less than doses 
known to be severely toxic (Dong et al. 1995; Geiler- 
Samerotte et al. 2011). We measured fitness by estimating 
how many cell divisions occurred in single-strain liquid cultures 
over a period of about 1 day (see Materials and Methods). This 
included both lag and growth phases resulting in an average 
number of doublings of 7.75 (median 7.83) with a standard 
deviation of 0.45. (The cultures reached about one-fourth of 
their final density.) Thus, variation in fitness was not high, 
especially given that a sizable portion of it came from differ- 
ences between plates and was eliminated from all subsequent 
analyses by within-plate normalization (see Materials and 
Methods). Previous studies evaluated the growth of colonies 
on common agar plates (Gelperin et al. 2005; Sopko et al. 
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2006) or in individual liquid cultures over a shorter time inter- 
val (Yoshikawa et al. 201 1; Makanae et al. 2013). Those ear- 
lier estimates generally agree with ours (supplementary fig. S1 , 
Supplementary Material online). We sought to assay fitness in 
a way that would increase the role of fast growth, and thus 
fast protein processing, in the final measure of fitness. 
Importantly, we wanted to compare quantitative fitness esti- 
mates with quantitative estimates of protein overproduction 
for a large number of individual clones, which had not been 
performed in previous studies. 

Figure 1 shows the distribution of normalized fitness esti- 
mates for 5,182 strains containing a unique cloned ORF 
known to express a protein (SGD). The intraclass correlation 
coefficient (ICC) calculated over four independent repeats was 
0.966, indicating that repeatability of our fitness measure- 
ments was high. Good repeatability within a strain and large 
differences between strains (the shape of clouds) suggest that 
factors other than measurement errors were responsible for 
much of the fitness variation. Some factors, such as the aver- 
age copy number of individual plasmids, could not be con- 
trolled in this experimental system. All individual records, both 
normalized and nonnormalized, are listed in supplementary 
table S1, Supplementary Material online. 

Functional Categorization Explains Little of the Gene 
Overexpression Effects 

As reported later in detail, the median content of overex- 
pressed proteins was approximately 400 times higher than 
the median content of normally expressed ones 
(Ghaemmaghami et al. 2003). This could potentially disturb 
at least some cellular functions. The overexpressed genes fell 
into 22 Yeast Slim GO cell component categories, 41 molec- 
ular function categories, and 100 biological process categories 
(we decided to reduce the biological process categories to 40 
by combining some of the most similar ones). Within each of 
these three classifications, we first applied a linear model in- 
cluding all categories and then progressively simplified it by 
eliminating statistically nonsignificant categories (see Materials 
and Methods). We obtained a relatively low number of po- 
tentially important predictors shown in figure 2. There were a 
few categories associated with increased fitness. These sug- 
gest that speeding up turnover of nucleotides and adjusting 
oxidative metabolism could have a positive effect on fitness. 
Negative effects were more numerous and larger. They were 
linked to cell wall and membrane structures. Although these 
factors were significant on a statistical level, they had very 
small average effects, approximately 0.005, which is clearly 
less than the standard deviation of the overall distribution of 
normalized fitness estimates, 0.032 (fig. lb). The observed 
weak dependence of fitness effects on the functions of the 
overexpressed proteins may be specific to our experimental 
system. Other arrangements, for example, Escherichia coli and 
high overexpression, have shown that unnaturally high levels 
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Fig. 1. — The effects of single gene overexpression on growth. The 
number of cell divisions in single-strain cultures was estimated four times 
independently. The estimates were divided by the median values of rele- 
vant replications to obtain normalized values, (a) The repeatability of the 
individual normalized fitness estimates and (b) the frequency distribution 
of strains' means. The vertical dashed line marks the slowest growing 91 
strains. These were removed from all of the following statistical analyses to 
make the distribution symmetric and closer to normal. (This exclusion was 
unlikely to affect our analyses. For example, we correlated fitness with ten 
properties of proteins for all data and those lacking the 77 data points. For 
data analyzed in this way, pairs of Pearson's coefficients were themselves 
very much correlated: Pearson's r= 0.988, Spearman's r s = 1). 



of transcription factors and regulatory proteins can be toxic 
(Singh and Dash 2013). 

To further test whether growth was indeed relatively insen- 
sitive to metabolic deregulation, we focused our analyses on 
enzymes alone. We revisited a study in which the molecular 
evolution of enzymes was considered dependent on their 
metabolic centrality and connectivity (Vitkup et al. 2006). 
Connectivity of an enzyme had been calculated as the 
number of other metabolic enzymes that produce or consume 
the enzyme's products or reactants. In our data set, 329 of the 
350 enzymes examined in the original study were included. 
We used the same categorization of metabolic connectivity 
but did not find it helpful in explaining the observed variation 
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Fig. 2. — Gene Ontology categories as predictors of the overexpression 
cost. The graph shows the highest and most statistically significant devia- 
tions of the Yeast Slim category means from the grand mean (not fitness 
gains or losses when compared with a strain with no overexpression). 



in the fitness response to gene overexpression (r= -0.029, 
P= 0.6). Apparently, the cell's metabolic network is well buff- 
ered against perturbations in the expression level of participat- 
ing enzymes, at least when single enzymes are overabundant. 
As reported earlier, most cellular structures and processes 
were also remarkably resistant to such alterations. We there- 
fore decided that it would be acceptable to execute the anal- 
ysis of protein properties for all genes together, ignoring their 
cellular roles and making the statistics both simpler and more 
powerful. 

Only a Few Protein Properties Correlate with the Cost of 
Overexpression 

A review of theoretical and empirical studies disclosed ten 
properties of proteins/mRNAs that were frequently examined 
as factors potentially affecting the rate of evolution. The de- 
pendence of fitness on the most significant factors is shown in 



figure 3a. The remaining factors are presented in supplemen- 
tary figure S2, Supplementary Material online. These graphs 
illustrate how the fitness of the overexpression strains corre- 
lates with each characteristic separately. They show that al- 
though the effects of some factors (e.g., protein length) are 
small, they can be remarkably regular. In a formal statistical 
analysis, we used a linear model, which examined jointly all 
single factors and selected interactions (see Materials and 
Methods). The results are reported more thoroughly in sup- 
plementary table S2, Supplementary Material online. Here, in 
figure 3b, we present only summaries of statistics for individ- 
ual factors. Some factors, such as protein half-life, codon ad- 
aptation index, frequency of physical interactions, abundance 
under normal expression, energy of 5' mRNA fold, and gravy 
score proved nonsignificant. Two of the statistically significant 
factors, the presence of transmembrane regions and the pro- 
portion of protein length occupied by sequences predicted to 
be loosely shaped (intrinsically disordered), refer to properties 
that become meaningful only after a protein chain is synthe- 
sized and folded. Other properties may be important at the 
time of synthesis. There was a negative correlation between 
the level of mRNA under normal expression and fitness. This 
could mean that overexpression of the normally common 
transcripts tends to deplete optimal tRNAs for production of 
redundant proteins and thus slow down elongation of those 
needed. However, the effect of high CAI on fitness, although 
negative, was not statistically significant. The energy of the 
folding of 5' mRNAs was also neutral, suggesting that tran- 
scripts with rigid spatial structures did not trap too many ribo- 
somes (Plotkin and Kudla 2010). It thus appears that there is 
no shortage of ribosomes, and possibly optimal tRNAs, when 
1 % of translation is useless, at least under the growth condi- 
tions applied here. Finally, there was a negative correlation 
between protein length and fitness indicating that the 
amount of an overproduced protein mattered (because all 
overexpressed proteins had the same promoter). This relation 
attracted our attention especially because it appeared to be 
very regular over the entire range of protein lengths (fig. 3a). 
We therefore decided to test experimentally whether the 
length of a protein is a good proxy for its amount under 
overexpression. 

Relating Fitness Cost to the Amount of Protein 

We estimated the cellular level of overproduced protein for a 
large sample of strains. Repeatability of estimates obtained 
by competitive ELISA was high (ICC = 0.944, n = 719, 
P«: 0.001) and centered on a median of 0.63% (fig. 4a). 
The relationship between the amount of overproduced pro- 
tein and its length is shown in figure 4b; Pearson's correlation 
coefficient was significant (r=0.136, df = 717, P=0.0002). 
To find a quantitative relation between the length of a protein 
and its amount under overexpression, we used a data set 
without the outliers seen in figure Ab (see supplementary 
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Fig. 3. — Protein properties and the fitness cost of overexpression. (a) Examples of fitness predictors (only the most significant predictors are shown; the 
remaining ones are in supplementary fig. S2, Supplementary Material online). Moving averages are shown as red lines for continuous variables, (b) Results of 
multifactorial analysis. Statistical significance of positive (green) and negative (red) effects is shown. 



methods, Supplementary Material online for details). We 
found that when the length of a protein doubles, its 
amount under overexpression increases by about one-half 
(the slope of a linear regression with both axes log-trans- 
formed was 0.47). We could then assign to every protein its 
expected amount under overexpression as a function of its 
length. From the common model of multiple regression, we 
found the relationships between the length of a protein (and 
its amount), the presence of transmembrane regions, and the 
presence of disordered regions, the three factors jointly effect- 
ing fitness (supplementary table S3, Supplementary Material 
online). This information is summarized in table 1, which lists 
the cost of expressing different proteins per 1 % of total pro- 
tein mass and per amino acid. To get the latter estimates, we 
assumed that the total mass of proteins in the yeast cell is 
6.0 x 10~ 12 g (Sherman 2002). Knowing the number of mol- 
ecules (Ghaemmaghami et al. 2003) and their molecular 
weights, we could calculate the total weight of every protein. 
The contribution of special regions was calculated from the 
proportions of the transmembrane or disordered regions cal- 
culated for every individual protein species (Persson and Argos 
1994; Krogh et al. 2001; Linding et al. 2003). One implicit 
assumption that could introduce only a minimal bias to our 



estimates is the assumption that the per amino acid weight of 
the transmembrane, disordered, and other regions was equal 
(see supplementary methods [Supplementary Material online] 
for more details regarding calculations). 

Table 1 shows that the average effect of having a disor- 
dered region or a transmembrane domain is remarkable but 
not excessively large. On average, disordered regions nearly 
doubled the fitness cost of the entire protein. Similarly, the 
membrane proteins were substantially more costly than were 
the cytosolic ones. The costs expressed per amino acid show 
the relative fitness changes of expanding some regions at the 
expense of other regions. They may also serve to compare 
fitness costs of proteins expressed at different levels. The 
yeast proteins are represented by very different numbers of 
molecules per cell under natural expression, from 10 to 1 
million (Ghaemmaghami et al. 2003). 

In the analyses described earlier, either some of the char- 
acteristics borrowed from other studies or our own measure- 
ments were lacking for a number of genes. We asked which 
of our results would hold if a single analysis were performed 
for those genes only for which both the fitness estimate, as 
well as the protein overexpression level, and all other variables 
were known. There were only 423 such genes. Detailed 
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Fig. 4. — The level of protein overexpression. (a) Frequency distribution 
of the amount of protein at the normal (empty bars) and overexpressed 
(filled bars) levels. Normal protein levels were taken from a previous study 
(Ghaemmaghami et al. 2003) and overexpression estimates were obtained 
in this study using a competitive ELISA assay, (b) The relationship between 
protein length and protein overexpression level (see supplementary meth- 
ods, Supplementary Material online). 



Table 1 

Fitness Cost of Protein Expression 
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Special Region 


Cost Per 




Protein b 


Fraction 


Single aa c 




(MeaniSE) 


(MeantSD) 


(MeaniSE) 


Standard 


0.023 ±0.005 




(7.32 ± 1.63) x 10" 11 


Disordered (added) 


0.01 7 ±0.004 


0.11 ±0.08 


(6.76 ± 1.47) x 10- 10 


Trans-membrane 


0.012 ±0.002 


0.13±0.10 


(4.78±0.82)x 10~ 10 


(added) 









a Proteins were standard (that is, cytosolic and well structured), contained 
disordered regions, and were located in membranes. The proportion of protein 
length taken by the disordered or transmembrane regions is shown in the middle 
column. 

^The fitness cost of producing 1% of superfluous polypeptide (standard), plus 
the costs added by the presence of disordered or transmembrane regions. 

c The fitness cost of expressing one amino acid in one protein molecule if the 
amino acid is located in standard or special regions. 

results are presented in supplementary table S4, Supplemen- 
tary Material online. Briefly, the presence of transmembrane 
domains remained the most significant factor. Three factors 
pertaining to protein abundance — the measured level, the re- 
ported half-life, and the predicted length — were also signifi- 
cant or nearly significant. This latest finding is yet another 
indication that it is not only the structural properties of a 



redundant protein but also its amount that contributes to 
toxicity. 

Discussion 

We found that overexpression of single genes in 
Saccharomyces cerevisiae generally leads to moderate but 
variable effects on growth. This variation is partly explained 
by the properties of the overexpressed protein molecules 
and the roles they play in cellular metabolism. Cell growth 
also correlated to the amount of overexpressed protein, in- 
dicating that synthesis and processing of useless polypep- 
tides lowers the efficiency of cell growth. This particular cost 
was relatively small, which explains why it has not been 
convincingly demonstrated in former studies. Proteins with 
disordered or intramembrane regions were especially dam- 
aging to fitness when overexpressed. Based on these 
findings, we propose that an addition, or exchange, of a 
single amino acid is of little consequence for fitness unless 
it extends or creates protein regions forming critical 
structures. 

There are two possible explanations why the disordered 
and transmembrane regions are especially damaging to fitness 
when overexpressed. One of them concentrates on overload, 
the other on toxicity. Considering overload, we note that the 
summed mass of all membrane proteins is 15% of the total 
protein content in a yeast cell. Similarly, the disordered 
stretches of polypeptides make up approximately 12% of 
total protein. Therefore, the same weight of an extra 1% 
of protein constitutes a considerably higher overload in 
terms of proportion added to the proteins that are in mem- 
branes or are disordered. The costs associated with transmem- 
brane proteins can include membrane piercing, interfering 
with other membrane proteins, or engaging membrane- 
specific folding pathways. Similarly, if maintaining the total 
pool of loosely structured proteins poses some special cost 
to the cell, then every overexpressed member of this group 
adds a higher proportion to this cost. Generally, the costs of 
overload could result from expressing those proteins that are 
more expensive/risky to keep in the cell even if they function as 
expected. A type of overload hypothesis has been proposed in 
which malfunctioning of membranes occurs in response to 
the overexpression of a membrane protein (Eames and 
Kortemme 2012). On the contrary, the cost of toxicity 
means that overexpressed protein chains acquire new and 
unwanted functions. It is possible that both the disordered 
and membrane proteins are especially likely to undergo such 
transformation. The disordered or unstructured regions have 
important functions in signaling, control, and regulation 
(Dunker et al. 2008). Proteins with such regions interact 
with one another and with unrelated proteins, which 
leads to misfolding and aggregation (Uversky et al. 2008; 
Vavouri et al. 2009; Olzscha et al. 2011). Aggregates 
tend to expose hydrophobic surfaces and therefore tend to 
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illegitimately penetrate and damage cellular membranes 
(Kourie and Henry 2002; Stefani 2008). Even the programmed 
formation of transmembrane domains can be sensitive to 
crowding and nonprescribed interactions with other regions 
of polypeptides (Levine et al. 2005; Mackenzie 2006; Skach 
2009; Chakrabarti et al. 201 1). In sum, there are good hypo- 
thetical explanations why transmembrane and disordered pro- 
teins are especially likely to be overloaded or driven into 
toxicity when overexpressed. However, substantial efforts 
would be needed to find which of the two possible mecha- 
nisms is actually occurring when a particular protein is 
overexpressed. 

There are two other properties of proteins that correlated 
with the cost of overexpression: the length of the polypeptide 
and the abundance of the cognate mRNA under normal ex- 
pression. As explained in the Results, we believe the two traits 
are simply correlated with the amount of useless protein and 
that this unnecessary burden is the real cause of fitness de- 
crease. We base our assumption on the remarkable regularity 
of the relationship between polypeptide length and fitness 
loss, as well as on a statistically significant relation between 
polypeptide length and an actual abundance of overexpressed 
protein in the cell. We considered two alternative hypotheses. 
One assumes that long proteins are disproportionally more 
likely to misfold and thus overexploit molecular chaperones. 
To test this, we asked whether the overexpression of proteins 
known to interact with molecular chaperones had more sub- 
stantial effects on fitness. We do not report these tests be- 
cause we did not find any relationship between the fitness 
cost and the frequency of interactions with single chaperones 
(Bogumil et al. 2012), sets of chaperones revealed in large- 
scale studies (Gong et al. 2009), or smaller but carefully 
confirmed chaperone assemblages (Hartl et al. 2011). 
These results are in accord with a report suggesting that chap- 
erones are efficient enough to handle a load of misfolded 
proteins that is substantially higher than 1 % (Vabulas and 
Hartl 2005). Another alternative explanation, that long pro- 
teins have more domains and thus are more damaging to 
the cellular regulatory mechanisms, has been tested and re- 
jected (see Results). We therefore propose that our observed 
negative effect of protein length on fitness reflects the general 
cost of protein processing, which includes all expenses in- 
volved in protein synthesis, maturation, maintenance, and 
disposal. 

Our results can be used to address the question of 
whether natural selection is strong enough to prevent a 
single amino acid being added or exchanged for another 
one. The efficiency with which genomes and proteomes are 
purged of mutations depends not only on the strength of 
their effects but also on population size (Lynch and Conery 
2003; Fernandez and Lynch 2011). Natural selection oper- 
ates when 2N(S> 1, where N e stands for effective popula- 
tion size and s for the selection coefficient. It is effective 
when the quotient is ten times higher. The effective 



population size of a species closely related to 5. cerevisiae, 
S. paradoxus, was estimated at 8.6 x 10 6 (Tsai et al. 2008). 
We found that the average cost of processing one amino 
acid is approximately 7 x 10~ 11 (table 1), so this would be 
the cost of adding one unnecessary amino acid to one poly- 
peptide and need to be multiplied by the number of af- 
fected molecules. It follows that to be nonneutral (2N(S> 1), 
a mutation of this type must hit a protein represented by 
more than 830 molecules per cell. In S. cerevisiae, some 
three-fourths of proteins meet this weaker criterion but 
only a small minority the stronger one (Ghaemmaghami 
et al. 2003). Thus, selection can possibly act on a single 
amino acid only if the effective population size is as large 
as in yeast and only if proteins are sufficiently abundant. 
The entire cost of this size would be at stake if an amino 
acid were to be deleted or inserted. Substitution would 
most likely still be less costly and thus more often neutral. 
In many organisms, the effective population size is much 
smaller, even by three orders of magnitude (Charlesworth 
2009; Gossmann et al. 2012), making selection still less ef- 
fective. Our empirical findings generally agree with the re- 
sults of a former computational study. Expediting single 
atoms of the main components of yeast biomass (such as 
carbon or nitrogen) has been found selectively nonneutral 
for just approximately 1% of proteins (those most abun- 
dantly expressed). Only under starvation for those rarer, 
such as sulfur, a wasteful use of one atom (or an amino 
acid in which it resides) can be significant for a substantial 
proportion of proteins (Bragg and Wagner 2009). 

Considering the factors that could control the evolution of 
protein sequence, it is remarkable that the fitness costs asso- 
ciated with amino acids residing within the disordered or 
transmembrane regions were so much higher. It appears jus- 
tifiable to speculate that natural selection would operate most 
intensely on mutations creating new or extending existing re- 
gions of danger. Not only mutations making misfolding or 
misinteraction unavoidable would be selected against (Yang 
et al. 2012) but also any changes in the DNA sequence that 
could increase the rate of transcriptional and translational 
errors resulting in alterations of the spatial structure of pro- 
teins (Drummond et al. 2005; Drummond and Wilke 2008). 
Such changes could result in selection coefficients that were 
higher by several orders of magnitude than those arising from 
amino acid substitutions in standard protein regions. This is 
because any unwinding of a polypeptide can involve dozens of 
amino acids, each being ten times more costly than it was in a 
safe structure. There is some evidence to suggest that selec- 
tion preventing structural aberration can be strong (Chiti and 
Dobson 2006; Geiler-Samerotte et al. 201 1), but further work 
is clearly needed to show that much or perhaps most of the 
variation in the rate of protein evolution can be attributed to 
selection, minimizing the danger of protein misfolding and 
toxicity. 
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Supplementary Material 

Supplementary methods, tables S1-S4, and figures S1 and S2 
are available at Genome Biology and Evolution online (http:// 
www.gbe.oxfordjournals.org/). 
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